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Abstract —Symbolic Aggregation approximation (SAX) has 
been the de facto standard representation methods for knowledge 
discovery in time series on a number of tasks and applications. 
So far, very little work has been done in empirically investigating 
the intrinsic properties and statistical mechanics in SAX words. 
In this paper, we applied several statistical measurements and 
proposed a new statistical measurement, i.e. information em¬ 
bedding cost (lEC) to analyze the statistical behaviors of the 
symbolic dynamics. Our experiments on the benchmark datasets 
and the clinical signals demonstrate that SAX can always reduce 
the complexity while preserving the core information embedded 
in the original time series with significant embedding efficiency. 
Our proposed lEC score provide a priori to determine if SAX is 
adequate for specific dataset, which can be generalized to evaluate 
other symbolic representations. Our work provides an analytical 
framework with several statistical tools to analyze, evaluate and 
further improve the symbolic dynamics for knowledge discovery 
in time series. 

Key Words — SAX; Knowledge Discovery in Time Series; 
Infomation Embedding Cost; Permutation Entropy; Symbolic 
Complexity 

I. Introduction 

Time series is a sequence of data obtained from sequen¬ 
tial measurements over time. Nowadays, time series data is 
ubiquitous among finance, health care, multimedia, agricul¬ 
ture and manufacturing, etc. Arising needs on various time 
series data mining tasks inspire a number of representation 
methods to reduce the volume, smooth the noise or extract 
the implicit structure for further data mining tasks. Some 
numerical approaches inspired by signal processing techniques 
are firstly applied. The Discrete Eourier Transformation (DET) 
framework is applied in the seminal work of Agrawal (ll by 
projecting the time series into the sine and cosine bases. Dis¬ 
crete Wavelet Transformation (DWT), which uses the scaled 
variety of mother wavelet function to give multiresonlutional 
decomposition basis to measure both high frequency and 
low frequency over large intervals, are intensively applied in 
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the literature over last decades (2), n, n. Singular Value 
Decomposition (SVD) and its extensions have been proposed 
to find multiscale patterns in the streaming data (a,©. Esling 
and Agon had a good review paper about other numerical 
representation methods (71. 

Instead of generating a sequence of numerical output, 
symbolic representations provide another perspective. Aligned 
Cluster Analysis (ACA) is introduced as an unsupervised 
method to cluster the temporal patterns of human motion data 
(a. It is an extension of kernel k-means clustering but requires 
quite computational capacity. Persist is an unsupervised dis¬ 
cretization methods to maximize the persistence measurement 
of each symbol ii. Piecewise Aggregation Approximation 
(PAA) methods is proposed by Keogh Col to reduce the 
dimensionality of time series, which is then upgraded to 
Symbolic Aggregation Approximation (SAX) (TTl . In SAX, 
each aggregation value after PAA process is mapped into the 
equiprobable intervals based on standard normal distribution 
to produce a sequence of symbolic representations. Sant’Anna 
et al. compared the comprehensive performance of above three 
symbolic approaches ifTTl . 

Among the above representation approaches, SAX method 
has become one of the de facto standard to discretize time 
series and is at the core of many effective classification 
algorithms. Koegh et al. introduce the new problem of finding 
time series discords and apply SAX derivations to find the 
subsequences of a longer time series that are maximally 
different to all the rest of the time series subsequences ca. 
A new SAX-based algorithm is proposed to discover time 
series motifs with invariance to uniform scaling. The authors 
show that their approach produces objectively superior results 
in several important domains ifT^ . Vector Space Model is 
combined together with SAX as a novel method to discover 
characteristic patterns in a time series ifTSll . SAX with bag- 
of-words are proved to be powerful in several time series 
classification tasks CSl, C3, CD. 

As the golden standard of supervised parametric approach in 
recent ten more years, what are the intrinsic statistical behavior 
of SAX? How can we evaluate and justify its performance 


through empirical studies? What can we learn from SAX to 
further improve the time series representations? The following 
properties/questions primarily motivate our work: 

• Data complexity: Time series data are inherently noisy 
and chaotic. Symbolization can improve the analysis 
through the assumption that it compresses the complexity 
and generates more concise representations. 

• Information loss: The PAA process of SAX reduces 
the volume but always incurs the information loss. The 
ideal results are achieved when the information loss is 
superfluous, e.g. noise is abandoned in the analysis of 
the raw time series. 

• Efficiency of information embedding: While SAX is 
supposed to smooth the noise, the following question 
is natural to ask about, how much useful information 
in original signals will be integrated into the symbolic 
output of SAX? 

• Seasonality and correlation: correlation and seasonality is 
important property in time series analysis, e.g. to deter¬ 
mine the order of ARIMA model. For pattern discovery 
task, explicitly extract the implicit seasonality in time 
series helps us to reveal the intrinsic correlations. How¬ 
ever, redundant correlation will affect the representation 
efficiency. 

We empirically studied above four aspects under a statistical 
perspectives. Briefly, permutation entropy Gl helps us to 
reveal the implicit complexity under the seemingly random¬ 
ness or chaos. KL divergence l(2Q]| measures the distribution 
distance between symbolic output and the raw data. The 
information reduction/loss is quantifled by standard recon¬ 
struction information loss and a new measurement. Informa¬ 
tion Embedding Cost (lEC) which is built upon the standard 
reconstruction information loss and KL divergence are applied 
to analyze the information embedding efficiency. We consider 
Autocorrelation function (ACE) and partial autocorrelation 
function (PACE) lEB as well to deeply understand the sea¬ 
sonality and internal correlation embedded in the raw data 
and symbolic outputs. We analyzed above properties between 
SAX, PAA and raw data on seven benchmark datasets 1221 
and the real world clinical signals. Our proposed analysis 
framework shows concretely reasonable conclusions. 

II. BACKGROUND 

In this section, we briefly introduce the SAX approach. 
SAX can be manually divided into two phases to facilitate our 
discussion in this paper. First, we reduce the dimensionality of 
the input time series by Piecewise Aggregation Approximation 
(PAA). Each value is the mean of its corresponding sliding 
windows. Then PAA values are mapped to the equiprobable 
intervals according to standard normal distribution as alphabet 
characters. Overall time series trend will be extracted as a 
symbolic sequence as SAX words. 

The algorithm has three parameters: window length n, 
word number w and alphabet size a. Different parameters 
lead to different representations of the time series. Given a 
normalized time series of length n, in the first step we need 



Fig. 1. The piecewise aggregate approximation for the ”ECG” data and the 
corresponding SAX words. 

to reduce the dimensionality by dividing it into [n/w] non¬ 
overlapping sub windows. Mean values of each sub-window 
are computed to reduce volume and smooth (PAA procedure). 
Then, PAA values are mapped to a probability density function 
A/'(0,1) to divide the PAA value into equiprobable segments 
with the same probabilities. Words starting from A io Z are 
assigned to each PAA values corresponding to the segments 
they fall in. Fig. shows the PAA and SAX word of ECG 
signals from the UCR dataset. A time series of length 96 is 
partitioned into 5 segments. The means of each segment are 
allocated to the equiprobable interval. After discretization by 
PAA and symbolization by SAX, we convert the time series 
into symbolic sequence CACBB. 

III. METHODS 
A. Permutation Entropy (PE) 

Permutation entropy (PE) is a measurement in symbolic 
dynamics to quantify the complexity of time series. Given 
a sequence of time series {xt}t=i...T^ PE considers all the 
permutation sets tt of order length n. All the possible orders 
of n different numbers are denoted as Sn- For each tt m Sn, 
we determine the relative frequency of possible permutation 
patterns as: 

, s ^ #{0 <t<T-n, {xt+l,Xt+ 2 ,-,Xt+n} e tt} 

T-n + 1 ^ 

When the order n > 2, PE value is defined as 

^N = -EU7r)logP(7r) (2) 

PE value is actually the sum runs over all possible n\ 
permutations in permutation pool Sn- It is clear that 0 < 
H{n) < \ogn\. We use the normalized PE to scale the 
PE value into [0,1] to facilitate our comparison. 

PE has two parameters. Permutation order n determines the 
length of the sequence, i.e. the size of the permutation pools. 
Time delay t controls the time embedding properties between 
successive points in the symbol sequences. The optimal n 
and t are highly dependable on the system when using PE to 
determine the optimal embedding parameters 1^ . We discuss 
the comparable complexity of raw data, PAA values and 
SAX words instead of finding a group of optimal parameters 
to precisely quantify the complexity. To simplify, we set 





























Fig. 2. Reconstructed signals of SAX word and PA A representation (left) 
and MSB over all samples on UCR Coffee dataset. 


t = 1 — 100 to observe the tendency of PE values by changing 
the time delay. For the parameter n, we follow the suggestions 
to constrain > 5n! 1^ . thus n < 7. Values with the 
same rank number in the regarded sequence generate a new 
permutation pattern, 3, 4, 4, 3, 1 leads to the permutation 
pattern 1, 2, 2, 1, 0 (ISj . 

B. Information Embedding Cost (lEC) 

1) information loss: Discretization always incur informa¬ 
tion loss. Piecewise averaging by PA A and SAX will reduce 
the dimensionality, but also ignore specific low frequency 
details in the raw time series. Information loss is estimated 
by Mean Square Error (MSB) between original signal and 
reconstructed symbolic sequences, which is slightly different 
from 


information loss (T, T) = - - —— ,ti e T,tj e T (3) 

T is the original signal and T is the reconstructed signal. 
Reconstructed signal was generated by substituting the orig¬ 
inal samples with its corresponding PA A values. For SAX, 
piecewise averaging from normalized signals are mapped to its 
corresponding equiprobable intervals for symbolization. Then 
the original samples were substituted by its corresponding 
SAX words. To facilitate the comparison and analysis, raw 
time series, PA A values and SAX words (A-Z correspond to 
digits 0-25) are scaled to [0,1] (Fig. 2). 

2) KL divergence: A measurement to compare the distance 
between two different probability distributions. For distribu¬ 
tions P and Q of k points, the non-symmetric KL divergence 
is defined as: 

k 

KL{P\\Q)=J2Pi^og^,Pi€P,qiGQ (4) 
.-1 


Information loss measures the amount of information dis¬ 
carded when coding the original signal to the symbolic output. 
KL divergence indicates the distance between the distribution 
from encoding outputs to the original signals. We define lEC 
score as the ADD-1 ratio of standardized KL divergence and 
standardized information loss. lEC measures how much useful 
information is discarded when approximating the original 
signals: 


IECt{P,Q) 


KLT{P\\Q)std 

1 + InfoLoss{f, T)atd 


(5) 


Given time series T in distribution Q and the proposed 
encoding approach f with distribution P, its lEC score demon¬ 
strates the number of extra bits required when encoding output 
T when one unit information loss incurs. Note that we do not 
equally emphasis on KL divergence and information loss in 
the lEC framework. The lEC score linearly decreases as KL 
divergence falling. However, when information loss increases, 
lEC score not only corresponds to the inverse square term 
(i+/n/oZoVt,T))" ’ depends on the value of KL di- 

vergence. This means that lEC score emphasis more on the KL 
divergence, or the information embedding accuracy rather than 
the compression rate of symbolic representations. Intuitively, 
this is reasonable because we need precise representations 
rather than the symbols that are too concise to preserve enough 
information from the original signals. 

I EC = 0 when the encoding approach f generates the 
representations with the same probability distribution as the 
original signals. I EC = 1 if and only if the deviation 
measured by KL divergence reaches to the maximum, i.e. no 
bit of original signal is compressed by the proposed symbolic 
representation. I EC = 0.5 when KL divergence and informa¬ 
tion loss are both 1. It is the baseline when applying lEC score 
to evaluate the information embedding efficiency of specific 
symbolic representation. I EC = 0.5 means the symbolic 
representation will lose one unit information while reduce 
one unit complexity. Relatively large information loss and 
small KL divergence demonstrate that the encoding method 
preserves much information while reducing the information 
complexity. Thus, the information loss caused by encoding 
method with small lEC score is more likely to produce more 
concise and accurate representation. 


C. (Partial) Autocorrealtion Coefficient Eunction 

For a stationary process Z = Zi,Z 2 ,,Zt, the covariance 
between Zt and is defined as 


jk = cov{Zt, Zt+k) = E{Zt - n){Zt+k - Id) (6) 


In information theory, the KL divergence measures the ex¬ 
pected number of extra bits required to code the samples 
from P when using a code based on Q rather than using 
the original code based on P itself. In the our experiments, 
P is the distribution of quantile bins in the original signals, 
the number of bins actually equals to the alphabet size a in 
SAX/PAA settings. 


The correlation between Zt and Zt^k as: 

As a function of k, pk is the Autocorrelation Coefficient 
Function (ACL). ACL represents the covariance and correla¬ 
tion between Zt and Zt-\-k with the time lag k. 




























TABLE I 

Optimal SAX parameters on seven benchmark datasets. 


Dataset 

Length n 

Rounding w 

Rounding a 

ECG 

96 

12 

7 

Lighting2 

637 

18 

7 

Coffee 

286 

48 

7 

Adiac 

176 

25 

9 

Lighting 7 

319 

11 

9 

Beef 

470 

11 

5 

Oliveoil 

570 

26 

7 


If we remove the mutual linear dependency on the in¬ 
tervening variables 1 , Zt+ 2 , • • •, +/c - 1 , the conditional 

correlation is described by: 

Corr{Zt^ ^t+25 •••5 ^t+/c-i) (8) 

Above equation is Partial Autocorrelation Coefficient Function 
(PACF). 

ACF and PACF are widely applied in time series analysis 
to identify the order of the ARIMA model. We focus on the 
essential property of ACF and PACF, i.e. the seasonality of 
time series. Symbolic representation reduces the complexity 
of the original signal while running the risk of information 
loss and information distortion. Regular patterns such as 
seasonality extracted from noisy and chaotic data is important 
in discovering more reasonable patterns embedded in the 
original data, but over-complicated correlation leads to large 
information redundancy. The ideal representation removes 
much of the superfluous correlation while retaining the main 
invariant dependencies. 

IV. EXPERIMENTS AND ANALYSIS 

We select 7 benchmark datasets from the UCR Time Se¬ 
ries Classification and Clustering Repository |[22]| . We also 
consider the real clinical signals to validate our results. Our 
clinical data were collected in University of XXX School of 
Medicine. All patient data are anonymous in order to protect 
privacy. 556 patients ECG and PPG data were collected in 
68 to 128 minutes long with 240 Hz sampling rate. SAX 
requires three parameters, window length n, word number 
w and alphabet size a. To empirically explore the intrinsic 
property of the appropriate SAX representations, we firstly 
use SAX-BoP representations with a INN classifier to classify 
the benchmark datasets, as described in uni. We extracted the 
average value of the alphabet size in the top 10 representations 
for each dataset. For simplicity. Window length n is the 
sequence length, word number w and alphabetical size a are 
the rounding values of the expectation among the top 10 
representations. All the parameters are shown in Table |I| 

A. Complexity 

To study the representational complexity of SAX, we con¬ 
sider the normalized permutation entropy on the benchmark 
datasets. For the clarification and the sake of space, we show 
the graph upon one datasets. However, our analysis are more 
general on all the datasets. In Table (1), it is clear that 


beef-150-7, n = 2-7, t = [1,150] 



beef-10-7-SAX, n = 2 - 7, t = [1,10] 



loglO(t) 


UMBECG-1000-7-SAX, n = 2 - 7, t = [1,1000] 



UMBECG-300-7-biSAX, n = 2 - 7, t = [1,300] 



Fig. 3. Permutation entropy on (a) raw UCR time series, permutation order 
n = [2, 7], t is up to | of the length (up left),(b) SAX words from UCR 
time series, permutation order n = [2, 7], t = [1,10] (up right), (c) clinical 
signals (bottom left), (d) SAX words on clinical signals (bottom right). 


a linear rise of the PE value incurs for the increasing t. 
It demonstrates the complexity of the unfolding trajectory 
is also increased. The correlation between the values in the 
embedding vector t x (n — 1) is getting lost. This observation 
is named ’redundancy effect’ by DeMicco et. al 1261 . High 
correlation leads to less need of visiting the phase of the 
data during the reconstruction of the trajectory. The increasing 
PE value shows the data is correlated within the embedding 
vector of t X (n — 1). The embedding vector is too small 
for the original data to summarize the full information among 
the potentially successive values in the time series that fall 
into the embedding vectors. The correlation, or the embedded 
information between the original time series is loose with 
much more noise and redundant irrelevance. This assumption 
is clearly supported by the observation, that along within the 
embedding vector t x (n — 1), the PE value is increasing 
linearly without convergence in Figure (a). Relatively high 
internal correlation leads to information redundancy in high 
dimensionality of time series and enhances the curse of 
dimensionality EH. 

After symbolization by SAX, the redundancy effect is 
significantly alleviated (Figure (b)). However, this may 
be caused by two possible reasons. Note that the symbolic 
sequence is shorter in length, one reason is that the number 
of permutation patterns is limited as t increases. Another 
potential reason is that SAX word not only decreases the 
dimensionality, it also makes the correlation more ’compact’. 
Instead of the data with very loose correlations, the SAX 
words does not experience in heavy redundancy effect when 
it reaches to its convergence or starts to diffuse. 

Moreover, the magnitude of the PE values demonstrates the 
absolute complexity. Compare (a) and (b) in Figure (and on 









































Info_loss 


0.025 

0.02 

0.015 

0.01 

0.005 

0 


1 11 21 31 41 51 


KL Divergence 



\AiV\ht^ 

31 41 51 


- SAX_MSE - PAA_MSE 


- SAX_KL - PAA_KL 


Info_loss 


KL Divergence 


0.2 

1 31 61 91 121 151 181 



TABLE II 

The IEC scores on the SAX/PAA representations and the 

CLASSIEICATION ERROR RATES ON SAX WORDS AND THE ORIGINAL 
DATA. 


Dataset 

SAX IEC 

PAA IEC 

Error Rate 
on SAX 

Error Rate 
on Raw Data 

lightingZ 

0.444 

0.179 

0.229 

0.197 

ECG 

0.259 

0.305 

0.22 

0.11 

coffee 

0.126 

0.242 

0.107 

0.25 

adiac 

0.096 

0.089 

0.383 

0.407 

lighting? 

0.496 

0.065 

0.397 

0.37 

beef 

0.416 

0.407 

0.466 

0.4 

oliveoil 

0.119 

0.883 

0.166 

0.233 

clinical ECG 

0.192 

0.611 

0.173 

0.545 


- SAX_MSE - PAA_MSE - SAX_KL - PAA_KL 

Fig. 4. Information loss and KL divergence on (a) ’Coffee’ dataset (up), 
(b) ’ECG’ dataset (bottom). X-axis denotes the sample index, Y-axis is the 
corresponding measurement. 


all datasets), the magnitude of the PE values overlap between 
raw and SAX However, in the benchmark dataset, the time 
series is not very noisy and the information is compact. We 
compared the PE values between the raw clinical data and 
its corresponding SAX words (Eigure (c) and (d)), SAX 
words clearly has lower PE magnitude than the raw data . 
Thus, SAX practically reduces the redundancy effect and the 
absolute complexity. 

B. Information Embedding Efficiency 

As the size of SAX representation sometimes exerts subtle 
influence on the analysis based on PE, explore other ways to 
study the information embedding efficiency is necessary. 

SAX steps further than PAA by mapping the PAA values to 
the corresponding words according to the standard Gaussian 
distribution. If SAX can encode more information in the SAX 
words while incurs less information loss, we suppose to study 
if this advantage benefits from PAA process or Gaussian 
mapping process. However, neither KL divergence nor infor¬ 
mation loss are capable to analyze the non-linear dependencies 
among different dataset, but run into a case-by-case manner. 
On some datasets, it is easy to discriminate the benefit from 
the SAX words over the PAA output (Eigure |^). However, 
the relative relationship might be ambiguous neither (Eigure 
Ef)- If a representation has small KL divergence with superior 
information loss, it will maintain more substantial information 
while discarding much of the noise. Thus, to compare the 
information embedding efficiency among different symbolic 
approaches, we propose the newly IEC score to measure how 
much information is discarded when compress the original 
signals. We take the average of IEC scores among all samples 
in each dataset and compare the classification performance of 
SAX ifTTll and original time series 1^ using a INN classifier 
(Table 1^. We expect that the representations with lower IEC 
score achieve lower classification error rates. 


Table |I^ demonstrates that SAX words always has lower IEC 
scores than the PAA values. The Gaussian mapping procedure 
actually improves the representational capability by getting rid 
of much noise while maintaining the output symbols nearer to 
the original signals. Another observation is that, when SAX 
words has higher IEC score on some datasets {e.g. Tighting2’ 
and ’lighting?’), the classification performance on SAX words 
are always worse than classifying the raw data. Actually, we 
suppose that the classification performance on PAA values 
will overtake SAX on the ’lighting2’ and ’lighting?’ datasets, 
because PAA values obviously has lower IEC scores than SAX 
words on these two datasets. On the clinical ECG dataset, 
the KL divergence, reconstruction information loss and IEC 
score of SAX words and PAA outputs clearly justify our 
assumptions. We rank above three measurements in ascending 
order (Eigure [^. Although its hard to judge the performance 
through the graph of KL divergence and information loss, 
obviously the SAX words has lower average IEC score over 
all samples (0.1919) than PAA output(0.6105). The low IEC 
score on SAX words also explains its superior performance 
on classification comparing with the raw data (Table |IJ. 

To deeply understand the non-linear correlation encoded 
between the IEC scores and the representational capability 
for time series data mining {i.e. classification performance), 
we applied the high-order regression model. Our assumption 
is that, the IEC score helps us to find out if one symbolic 
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Fig. 5. The KL divergence, reconstruction information loss and IEC scores 
of the SAX words and PAA output on the clinical ECG data. 








TABLE III 

Regression statistics oe the classieication error ratio and the 

lEC SCORES ON SEVEN BENCHMARK DATASETS AND ONE REAL-WORLD 
CLINICAL DATASET. INDEPENDENT VARIABLE X IS THE lEC SCORE, 
DEPENDENT VARIABLE y IS THE ERROR RATIO. 


Regression Statistics 

Multiple R 

0.962576 

R Square 

0.926553 

Intercept 

0 

X 

9.454057 

x"^ 

-14.9802 


TABLE IV 

The absolute mean value oe the ACE on raw data, SAX words 
AND PAA OUTPUT. 


Dataset 

RAW 

SAX 

PAA 

lighting2 

0.3502 

0.2336 

0.2112 

ECG200 

0.1981 

0.2907 

0.2787 

Coffee 

0.4723 

0.3237 

0.3227 

Adiac 

0.4656 

0.3237 

0.3227 

lighting7 

0.2455 

0.2868 

0.2704 

Beef 

0.4040 

0.2722 

0.2599 

Oliveoil 

0.3226 

0.1763 

0.1680 

Clinical ECG 

0.6331 

0.5563 

0.3605 


representation outperform raw data on specific dataset. We 
suppose that, the improvement on classification error rate of 
the SAX words over the original signal, defined as , 

are correlated to their corresponding lEC scores. By the 
quadratic regression, we found the error ratio and the lEC 
scores are highly correlated with the a R-square 0.9265 (Table. 
3). Thus, we suggest that when the lEC score on the symbolic 
representation is lower than 0.2 with some specific tolerance, 
it is more likely to overtake the original signals in the data 
mining tasks such as classification or retrieval. 


C. Internal Correlation/Seasonality 

The internal correlation indicates if the representation can 
preserve the significant seasonal patterns with strong correla¬ 
tion in the data while removing much of the redundant correla¬ 
tions. A good representation should smooth the noise, reduce 
the volume and simultaneously extract the main patterns. We 
apply ACE and PACE to evaluate the ability to capture intrinsic 
correlation embedded in time series on SAX words and PAA 
output. In our experiments, both PAA and SAX can extract 
the correlation tendency in the original signals, because the 
ACE/PACE plots have the same shape on both PAA output 
and SAX words (Eigurej^. The maximum peak value in ACE 
measures the periodicity in a sequence of signals flTj . Thus, 
we look into the absolute mean among the ACE values and 
take the average over all samples in each dataset (Table IV). 
SAX words actually always have the higher absolute mean 
ACE than PAA output, thus preserve more internal correlations 
than PAA output. 

We also explored the morphology of the internal correlation 



Fig. 6. (a) The ACE/PACE plots of raw data, SAX words and PAA output 

on the benchmark ’ECG’ dataset. SAX words and PAA output both preserve 
the correlation while compressing the original signals. The absolute mean of 
ACE value over all samples is 0.35 on the raw signals, 0.23 on the SAX 
words and 0.21 on PAA values (left), (b) The ACE/PACE plots on the clinical 
dataset (right). 


embedded in time series by the SAX words and PAA output on 
the real-world clinical dataset (Eigure [^). Both SAX words 
and PAA output have the similar scheme of ACE to preserve 
the internal correlation. The PACE behaves quite interestingly. 
The average ACE value of the original signals, SAX words and 
PAA output are 0.587, 0.556 and 0.360. SAX and PAA both 
decrease the correlation redundancy to generate more compact 
representations. The smoothness of the redundancy effect by 
SAX words also interpret the benefit that SAX compresses the 
internal correlation. 

While the ACE indicates both SAX and PAA will extract the 
internal correlation, the PACE demonstrates these two methods 
do not follow the same routines, but have different recurrent 
relations. In time series analysis, we decide the order of auto¬ 
regressive (AR) process from the significance beyond the time 
lags, because those insignificant coefficients do not exhibit any 
patterns. Although it is clear that test time series are not a 
stationary process and most of them have the seasonal trend, 
the PACE shows complex behavior on SAX words and PAA 
values. Interestingly, the peak value of PACE on SAX words 
and PAA output sometimes has the different sign, which means 
the correlation orientations are opposite after the Gaussian 
mapping procedure. 

V. Conclusion 

To the best of our knowledge, this paper is the first attempt 
to explore the intrinsic statistical properties of Symbolic 
Aggregation Approximation. We empirically investigates the 
intrinsic properties of Symbolic Aggregation approximation 
method in the statistical perspective. We explain the statistical 
behavior of SAX words via three aspects, the complexity, 
the information embedding efficiency and the internal cor¬ 
relation. We also proposed a new measurement. Information 
Embedding Cost (lEC) to evaluate the information embedding 
efficiency of the symbolic dynamics. We proved that high 
correlation lies between the lEC score and the classification 
performance. 

The logic map is clear. Lower absolute permutation entropy 
implies that SAX words significantly reduce the complexity 
of the original time series. This observation motivates us to 
investigate how much information will the symbolic repre¬ 
sentations retain and discard after processing by the SAX. 
We develop a novel measurement, lEC score based on KL 
divergence and reconstruction information loss to evaluate the 































































information embedding efficiency. Our experiments indicate 
that SAX shows significant improvement than PAA with lower 
lEC score. We recommend the criterion of lEC score to be 
0.2 when considering if the SAX representations work better 
or not. If test representation has lower lEC score than 0.2, 
it is much likely that SAX will improve the classification 
performance compared with original signals. Such analysis 
on lEC score can also be generalized to other symbolic 
representations such as ACA, Persist or some newly proposed 
representation methods. 

Redundancy effect incurs in the analysis on permutation 
entropy. This observation motivate us to investigate how SAX 
behaves on the internal correlation embedded in the original 
time series. Our experiments demonstrate that the internal 
correlation which is measured by ACE is preserved by both 
SAX and PAA precisely and concisely. However, moving 
average (MA) model will not fit the output signal well, because 
the ACE indicates seasonal properties before differencing. 
The same trends preserved by the ACE and lower average 
ACE values imply that PAA-based discretization reduce the 
redundant internal correlation and maintain major correlations. 
SAX words always have slightly higher average ACE value 
than PAA, thus maintain the better internal correlations. The 
PACE behavior are quite different within SAX and PAA, which 
implies that Gaussian mapping procedure greatly changes the 
conditional correlation among original data. This observation 
is interesting. We will try to reveal the symbolic mechanics of 
this observation in the future work. 

Our work can support the SAX method and its further 
applications, provide the analytical framework and statistical 
tools to evaluate symbolic representations methods, and inspire 
the researcher to design new symbolic dynamics for time series 
data mining tasks. 

References 

[1] R. Agrawal, C. Faloutsos, and A. Swami, Efficient similarity search in 
sequence databases. Springer, 1993. 

[2] A. Grinsted, J. C. Moore, and S. Jevrejeva, ‘Application of the cross 
wavelet transform and wavelet coherence to geophysical time series,” 
Nonlinear processes in geophysics, vol. 11, no. 5/6, pp. 561-566, 2004. 

[3] K. Lau and H. Weng, “Climate signal detection using wavelet transform: 
How to make a time series sing,” Bulletin of the American Meteorolog¬ 
ical Society, vol. 76, no. 12, pp. 2391-2402, 1995. 

[4] R C. Ivanov, M. G. Rosenblum, C. Peng, J. Mietus, S. Havlin, H. Stanley, 
and A. L. Goldberger, “Scaling behaviour of heartbeat intervals obtained 
by wavelet-based time-series analysis,” Nature, vol. 383, no. 6598, pp. 
323-327, 1996. 

[5] F. Korn, H. V. Jagadish, and C. Faloutsos, “Efficiently supporting ad 
hoc queries in large datasets of time sequences,” ACM SIGMOD Record, 
vol. 26, no. 2, pp. 289-300, 1997. 

[6] K. Ravi Kanth, D. Agrawal, and A. Singh, “Dimensionality reduction for 
similarity searching in dynamic databases,” in ACM SIGMOD Record, 
vol. 27, no. 2. ACM, 1998, pp. 166-176. 

[7] P. Esling and C. Agon, “Time-series data mining,” ACM Computing 
Surveys (CSUR), vol. 45, no. 1, p. 12, 2012. 

[8] F. Zhou, F. Torre, and J. K. Hodgins, “Aligned cluster analysis for 
temporal segmentation of human motion,” in Automatic Face & Gesture 
Recognition, 2008. 8th IEEE International Conference on. IEEE, 2008, 
pp. 1-7. 

[9] F. Morchen and A. Ultsch, “Finding persisting states for knowledge 
discovery in time series,” in From Data and Information Analysis to 
Knowledge Engineering. Springer, 2006, pp. 278-285. 


[10] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionality 
reduction for fast similarity search in large time series databases,” 
Knowledge and information Systems, vol. 3, no. 3, pp. 263-286, 2001. 

[11] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A symbolic representation of 
time series, with implications for streaming algorithms,” in Proceedings 
of the 8th ACM SIGMOD workshop on Research issues in data mining 
and knowledge discovery. ACM, 2003, pp. 2-11. 

[12] A. Sant’Anna and N. Wickstrom, “Symbolization of time-series: An 
evaluation of sax, persist, and aca,” in Image and Signal Processing 
(CISP), 2011 4th International Congress on, vol. 4. IEEE, 2011, pp. 
2223-2228. 

[13] E. Keogh, J. Lin, and A. Fu, “Hot sax: Efficiently finding the most un¬ 
usual time series subsequence,” in Data mining, fifth IEEE international 
conference on. IEEE, 2005, pp. 8-pp. 

[14] D. Yankov, E. Keogh, J. Medina, B. Chiu, and V. Zordan, “Detecting 
time series motifs under uniform scaling,” in Proceedings of the 13th 
ACM SIGKDD international conference on Knowledge discovery and 
data mining. ACM, 2007, pp. 844-853. 

[15] P. Senin and S. Malinchik, “Sax-vsm: Interpretable time series classifi¬ 
cation using sax and vector space model,” in Data Mining (ICDM), 2013 
IEEE I3th International Conference on. IEEE, 2013, pp. 1175-1180. 

[16] T. Oates, C. F. Mackenzie, L. G. Stansbury, B. Aarabi, D. M. Stein, 
and P. F. Hu, “Predicting patient outcomes from a few hours of high 
resolution vital signs data,” in Machine Learning and Applications 
(ICMLA), 2012 Ilth International Conference on, vol. 2. IEEE, 2012, 
pp. 192-197. 

[17] T. Oates, C. F. Mackenzie, D. M. Stein, L. G. Stansbury, J. Dubose, 
B. Aarabi, and P F. Hu, “Exploiting representational diversity for time 
series classification,” in Machine Learning and Applications (ICMLA), 
2012 Ilth International Conference on, vol. 2. IEEE, 2012, pp. 538- 
544. 

[18] Z. Wang and T. Oates, “Time warping symbolic aggregation approxima¬ 
tion with bag-of-patterns representation for time series classification,” in 
Machine Learning and Applications (ICMLA), 2014 13th International 
Conference on. IEEE, 2014, pp. 270-275. 

[19] C. Bandt and B. Pompe, “Permutation entropy: a natural complexity 
measure for time series,” Physical review letters, vol. 88, no. 17, p. 
174102, 2002. 

[20] S. Kullback and R. A. Leibler, “On information and sufficiency,” The 
annals of mathematical statistics, pp. 79-86, 1951. 

[21] W. W.-S. Wei, Time series analysis. Addison-Wesley publ, 1994. 

[22] E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana, “The ucr time 
series classification/clustering homepage,” URL= http://www. cs. ucr. 
edu/~ eamonn/time_series_data, 2006. 

[23] M. Riedl, A. Muller, and N. Wessel, “Practical considerations of 
permutation entropy,” The European Physical Journal Special Topics, 
vol. 222, no. 2, pp. 249-262, 2013. 

[24] J. Amigo, S. Zambrano, and M. A. Sanjuan, “Combinatorial detection 
of determinism in noisy time series,” EPL (Europhysics Letters), vol. 83, 
no. 6, p. 60005, 2008. 

[25] C. Bian, C. Qin, Q. D. Ma, and Q. Shen, “Modified permutation-entropy 
analysis of heartbeat dynamics,” Physical Review E, vol. 85, no. 2, p. 
021906, 2012. 

[26] L. De Micco, J. G. Fernandez, H. A. Larrondo, A. Plastino, and O. A. 
Rosso, “Sampling period, statistical complexity, and chaotic attractors,” 
Physica A: Statistical Mechanics and its Applications, vol. 391, no. 8, 
pp. 2564-2575, 2012. 

[27] R. Bellman, R. E. Bellman, R. E. Bellman, and R. E. Bellman, Adaptive 
control processes: a guided tour. Princeton university press Princeton, 
1961, vol. 4. 



